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Abstract. We propose novel controller synthesis techniques for probabilistic systems 
modelled using stochastic two-player games: one player acts as a controller, the second 
represents its environment, and probability is used to capture uncertainty arising due to, 
for example, unreliable sensors or faulty system components. Our aim is to generate robust 
controllers that are resilient to unexpected system changes at runtime, and flexible enough 
to be adapted if additional constraints need to be imposed. We develop a permissive 
controller synthesis framework, which generates multi-strategies for the controller, offering a 
choice of control actions to take at each time step. We formalise the notion of permissivity 
using penalties, which are incurred each time a possible control action is disallowed by 
a multi-strategy. Permissive controller synthesis aims to generate a multi-strategy that 
minimises these penalties, whilst guaranteeing the satisfaction of a specified system property. 
We establish several key results about the optimality of multi-strategies and the complexity 
of synthesising them. Then, we develop methods to perform permissive controller synthesis 
using mixed integer linear programming and illustrate their effectiveness on a selection of 
case studies. 


1. Introduction 

Probabilistic model checking is used to automatically verify systems with stochastic behaviour. 
Systems are modelled as, for example, Markov chains, Markov decision processes, or stochastic 
games, and analysed algorithmically to verify quantitative properties specified in temporal 
logic. Applications include checking the safe operation of fault-prone systems (“the brakes fail 
to deploy with probability at most 10~ 6 ”) and establishing guarantees on the performance 
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of, for example, randomised communication protocols (“the expected time to establish 
connectivity between two devices never exceeds 1.5 seconds”). 

A closely related problem is that of controller synthesis. This entails constructing a 
model of some entity that can be controlled (e.g., a robot, a vehicle or a machine) and its 
environment, formally specifying the desired behaviour of the system, and then generating, 
through an analysis of the model, a controller that will guarantee the required behaviour. In 
many applications of controller synthesis, a model of the system is inherently probabilistic. 
For example, a robot’s sensors and actuators may be unreliable, resulting in uncertainty 
when detecting and responding to its current state; or messages sent wirelessly to a vehicle 
may fail to be delivered with some probability. 

In such cases, the same techniques that underly probabilistic model checking can be 
used for controller synthesis. For, example, we can model the system as a Markov decision 
process (MDP), specify a property f in a probabilistic temporal logic such as PCTL or LTL, 
and then apply probabilistic model checking. This yields an optimal strategy (policy) for the 
MDP, which instructs the controller as to which action should be taken in each state of the 
model in order to guarantee that <fi will be satisfied. This approach has been successfully 
applied in a variety of application domains, to synthesise, for example: control strategies 
for robots [22], power management strategies for hardware [16], and efficient PIN guessing 
attacks against hardware security modules [29] . 

Another important dimension of the controller synthesis problem is the presence of 
uncontrollable or adversarial aspects of the environment. We can take account of this by 
phrasing the system model as a game between two players, one representing the controller 
and the other the environment. Examples of this approach include controller synthesis for 
surveillance cameras [21], autonomous vehicles nn or real-time systems fl). In our setting, 
we use (turn-based) stochastic two-player games, which can be seen as a generalisation of 
MDPs where decisions are made by two distinct players. Probabilistic model checking of 
such a game yields a strategy for the controller player which guarantees satisfaction of a 
property f, regardless of the actions of the environment player. 

In this paper, we tackle the problem of synthesising robust and flexible controllers, which 
are resilient to unexpected changes in the system at runtime. For example, one or more of 
the actions that the controller can choose at runtime might unexpectedly become unavailable, 
or additional constraints may be imposed on the system that make some actions preferable 
to others. One motivation for our work is its applicability to model-driven runtime control of 
adaptive systems [5], which uses probabilistic model checking in an online fashion to adapt 
or reconfigure a system at runtime in order to guarantee the satisfaction of certain formally 
specified performance or reliability requirements. 

We develop novel, permissive controller synthesis techniques for systems modelled as 
stochastic two-player games. Rather than generating strategies, which specify a single action 
to take at each time-step, we synthesise multi-strategies, which specify multiple possible 
actions. As in classical controller synthesis, generation of a multi-strategy is driven by a 
formally specified quantitative property: we focus on probabilistic reachability and expected 
total reward properties. The property must be guaranteed to hold, whichever of the specified 
actions are taken and regardless of the behaviour of the environment. Simultaneously, we 
aim to synthesise multi-strategies that are as permissive as possible, which we quantify by 
assigning penalties to actions. These are incurred when a multi-strategy disallows (does not 
make available) a given action. Actions can be assigned different penalty values to indicate 
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the relative importance of allowing them. Permissive controller synthesis amounts to finding 
a multi-strategy whose total incurred penalty is minimal, or below some given threshold. 

We formalise the permissive controller synthesis problem and then establish several key 
theoretical results. In particular, we show that randomised multi-strategies are strictly more 
powerful than deterministic ones, and we prove that the permissive controller synthesis 
problem is NP-hard for either class. We also establish upper bounds, showing that the 
problem is in NP and PSPACE for the deterministic and randomised cases, respectively. 

Next, we propose practical methods for synthesising multi-strategies using mixed integer 
linear programming (MILP) [57]. We give an exact encoding for deterministic multi-strategies 
and an approximation scheme (with adaptable precision) for the randomised case. For the 
latter, we prove several additional results that allow us to reduce the search space of 
multi-strategies. The MILP solution process works incrementally, yielding increasingly 
permissive multi-strategies, and can thus be terminated early if required. This is well suited 
to scenarios where time is limited, such as online analysis for runtime control, as discussed 
above, or “anytime verification” [28] . Finally, we implement our techniques and evaluate 
their effectiveness on a range of case studies. 

This paper is an extended version of m, containing complete proofs, optimisations for 
MILP encodings and experiments comparing performance under two different MILP solvers. 

1.1. Related Work. Permissive strategies in non-stochastic games were first studied in [2] 
for parity objectives, but permissivity was defined solely by comparing enabled actions. 
Bouyer et al. [3j showed that optimally permissive memoryless strategies exist for reachability 
objectives and expected penalties, contrasting with our (stochastic) setting, where they may 
not. The work in [3j also studies penalties given as mean-payoff and discounted reward 
functions, and [4] extends the results to the setting of parity games. None of ElM consider 
stochastic games or even randomised strategies, and they provide purely theoretical results. 
As in our work, Kumar and Garg [20] consider control of stochastic systems by dynamically 
disabling events; however, rather than stochastic games, their models are essentially Markov 
chains, which the possibility of selectively disabling branches turns into MDPs. [26] studies 
games where the aim of one opponent is to ensure properties of systems against an opponent 
who can modify the system on-the-fly by removing some transitions. 

Finally, although tackling a rather different problem (counterexample generation), [31J 
is related in that it also uses MILP to solve probabilistic verification problems. 

2. Preliminaries 

We denote by Dist(X) the set of discrete probability distributions over a set X. A Dirac 
distribution is one that assigns probability 1 to some s£l, The support of a distribution 
d E Dist(X) is defined as supp(d) = {x E X \ d(x) > 0}. 

2.1. Stochastic Games. In this paper, we use turn-based stochastic two-player games , 
which we often refer to simply as stochastic games. A stochastic game takes the form 
G = (So, <Sb, s, A, 5), where S = So U Sb is a finite set of states, each associated with player 
0 or □, s E S is an initial state, A is a finite set of actions and <5 : SxA —> Dist(S) is a 
(partial) probabilistic transition function such that the distributions assigned by 5 only select 
elements of S with rational probabilities. 
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An MDP is a stochastic game in which either S§ or Sa is empty. Each state s of a 
stochastic game G has a set of enabled actions, given by A(s) = {a E A \ S(s, a) is defined}. 
The unique player o E {0, □} such that s E S 0 picks an action a E A(s) to be taken in state 
s. Then, the next state is determined randomly according to the distribution 5(s,a), i.e., a 
transition to state s' occurs with probability 5(s,a)(s'). 

A path through G is a (finite or infinite) sequence u = soaosifli •.., where s* E S, 
ai E A(si) and 5(si, Oi)(sj+i) > 0 for all i. We denote by IPath s the set of all infinite paths 
starting in state s. For a player o E {0,0}, we denote by FPath° the set of all finite paths 
starting in any state and ending in a state from S 0 . 

A strategy a : FPath° Dist(A) for player o of G is a resolution of the choices of 
actions in each state from S D based on the execution so far, such that only enabled actions 
in a state are chosen with non-zero probability. In standard fashion m, a pair of strategies 
<7 and 7r for 0 and □ induces, for any state s, a probability measure Ptq* over IPath s . A 
strategy a is deterministic if a(u) is a Dirac distribution for all lo, and randomised if not. 
In this work, we focus purely on memoryless strategies, where cr(w) depends only on the 
last state of u ;, in which case we define the strategy as a function a : S 0 Dist(A). We 
write Eg for the set of all (memoryless) player o strategies in G. 


2.2. Properties and Rewards. In order to synthesise controllers, we need a formal de¬ 
scription of their required properties. In this paper, we use two common classes of properties: 
probabilistic reachability and expected total reward, which we will express in an extended 
version of the temporal logic PCTL [IS] . 

For probabilistic reachability, we write properties of the form (j> = P Mp [F g], where 
cxi E {^, ^}, p E [0, 1] and g C S is a set of target states, meaning that the probability 
of reaching a state in g satisfies the bound tx p. Formally, for a specific pair of strategies 
a E Eg, 7r E Eg for G, the probability of reaching g under a and it is 

g) = PrgT({soOoSifli • • • E IPaths \ Si E g for some *}). 

We say that 0 is satisfied under a and 7r, denoted G, a, n \= <f>, if FVg}(F g) ex p. 

We also augment stochastic games with reward structures, which are functions of the 
form r : S x A —>• Q^o mapping state-action pairs to non-negative rationals. In practice, we 
often use these to represent “costs” (e.g. elapsed time or energy consumption), despite the 
terminology “rewards”. The restriction to non-negative rewards allows us to avoid problems 
with non-uniqueness of total rewards, which would require special treatment 113 - 

Rewards are accumulated along a path and, for strategies a E Eg and 7r E Eg, the 
expected total reward is defined as: 

« OO 

e gI ( r )= r ( s i ’ dPr cl- 

Juj=soaosiai---£lPath^ j=0 

For technical reasons, we will always assume the maximum possible reward sup CT Eq ^(r) 
is finite (which can be checked with an analysis of the game’s underlying graph); similar 
assumptions are commonly introduced [[25] Section 7]. In our proofs, we will also use 
EgT(r}s) for the expected total reward accumulated before the first visit to s, defined by: 


e: 


SW*) = [ 

J iJ 


u)=soaosiai...£lPath- 


fst(s,Lj) — 1 

i =0 


Oj) dPr 


<T,7T 

G,s 
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south 



Figure 1: A stochastic game G used as a running example (see Ex. 2.2). 


where fst(s, u) is min{z | Si = s} if ui = soaosicq • • • contains Sj, and oo otherwise. 

An expected reward property is written cj) = R^JC] (where C stands for cumulative), 
meaning that the expected total reward for r satisfies ixi h. We say that <j) is satisfied under 
strategies a and 7r, denoted G, a, ir |= 4>, if E^'\,(r) txi b. 

In fact, probabilistic reachability can easily be reduced to expected total reward (by 
replacing any outgoing transitions from states in the target set with a single transition to a 
sink state labelled with a reward of 1). Thus, in the techniques presented in this paper, we 
focus purely on expected total reward. 


2.3. Controller Synthesis. To perform controller synthesis, we model the system as a 
stochastic game G = (Sq, Sb, s, A, 6), where player 0 represents the controller and player 
□ represents the environment. A specification of the required behaviour of the system is a 
property </>, either a probabilistic reachability property P^<p[F g] or an expected total reward 
property R r Mb [C], 

Definition 2.1 (Sound strategy). A strategy a £ E^ for player 0 in stochastic game G is 
sound for a property cf) if G, a, 7 r f= 0 for any strategy 7 r £ Eg. 

To simplify notation, we will consistently use a and 7 r to refer to strategies of player 0 
and □, respectively, and will not always state explicitly that cr £ E^ and tt £ Eq. Notice 
that, in Defn. \2.1\ strategies a and 7r are both memoryless. We could equivalently allow ir to 
range over history-dependent strategies since, for the properties </> considered in this paper 
(probabilistic reachability and expected total reward), the existence of a history-dependent 
counter-strategy ir for which G,<r, tt \/= cf) implies the existence of a memoryless one. 

The classical controller synthesis problem asks whether there exists a sound strategy 
for game G and property cf>. We can determine whether this is the case by computing the 
optimal strategy for player 0 in g mm- This problem is known to be in NP n co-NP, 
but, in practice, methods such as value or policy iteration can be used efficiently. 

Example 2.2. Fig. [l] shows a stochastic game G, with controller and environment player 
states drawn as diamonds and squares, respectively. It models the control of a robot moving 
between 4 locations (so, s 2 , s 3 , S 5 ). When moving east (so~»S 2 or S 3 —>- 55 ), it may be impeded 
by a second robot, depending on the position of the latter. If it is impeded, there is a chance 
that it does not successfully move to the next location. 

We use a reward structure moves, which assigns 1 to the controller actions north, east, 
south, and define property cj) = meaning that the expected number of moves to 

reach S 5 is at most 5 (notice that S 5 is the only state from which all subsequent transitions 
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have reward zero). A sound strategy for <j> hr G (found by minimising moves) chooses south 
in sq and east in s 3 , yielding an expected number of moves of 3.5. 


3. Permissive Controller Synthesis 

We now define a framework for permissive controller synthesis, which generalises classical 
controller synthesis by producing multi-strategies that offer the controller flexibility about 
which actions to take in each state. 

3.1. Multi-Strategies. Multi-strategies generalise the notion of strategies, as defined in 
Section [2] They will always be defined for player 0 of a game. 

Definition 3.1 (Multi-strategy). Let G = (Sq, S\j,s, A, 5) be a stochastic game. A (nremo- 
ryless) multi-strategy for G is a function 0 : S<)^Dist(2 A ) with 9(s)(0) = 0 for all s E Sq. 

As for strategies, a multi-strategy 9 is deterministic if 9 always returns a Dirac distribu¬ 
tion, and randomised otherwise. We write @g ei and for the sets of all deterministic 

and randomised multi-strategies in G, respectively. 

A deterministic multi-strategy 9 chooses a set of allowed actions in each state s E Sq, 
i.e., those in the unique set B C A for which 9(s)(B) = 1. When 9 is deterministic, we 
will often abuse notation and write a 6 0(s) for the actions a E B. The remaining actions 
A(s) \ B are said to be disallowed in s. In contrast to classical controller synthesis, where a 
strategy a can be seen as providing instructions about precisely which action to take in each 
state, in permissive controller synthesis a multi-strategy provides (allows) multiple actions, 
any of which can be taken. A randomised multi-strategy generalises this by selecting a set 
of allowed actions in state s randomly, according to distribution 6(s). 

We say that a controller strategy a complies with multi-strategy 9, denoted a < 9, if 
it picks actions that are allowed by 9. Formally (taking into account the possibility of 
randomisation), we define this as follows. 

Definition 3.2 (Compliant strategy). Let 9 be a multi-strategy and a a strategy for a game 
G. We say that <7 is compliant (or that it complies ) with 9, written a <9, if, for any state 
s 6 Sq and non-empty subset B C A(s), there is a distribution d S}B £ Dist(B) such that, 
for all a <E A(s), cr(s)(a) = 0(s)(B) ■ d S)B (a). 

Example 3.3. Let us explain the technical definition of a compliant strategy on the 
game from Ex. |2.2| (see Fig. [Tj) . Consider a randomised multi-strategy 9 that, in so, picks 
{east, south} with probability 0.5, {south} with probability 0.3, and {east} with probability 
0.2. A compliant strategy then needs to, for some number 0 ^ x ^ 1, pick south with 
probability 0.3 + 0.5 -a: and east with probability 0.2 + 0.5- (1 — x). The number x corresponds 
to the probability d S0 s eastiSOut f l \(south) in the formal definition above. 

Hence, a strategy a that picks east and south with equal probability 0.5 satisfies the 
requirements of compliance in state so; as witnessed by selecting x = 0.4, or, in other words, 
the distribution d SQ ^ east ^ sout y^ assigning 0.4 and 0.6 to south and east, respectively. On the 
other hand, a strategy that picks east with probability 0.8 cannot be compliant with 9. 

Each multi-strategy determines a set of compliant strategies, and our aim is to design 
multi-strategies which allow as many actions as possible, but at the same time ensure that 
any compliant strategy satisfies some specified property. We define the notion of a sound 
multi-strategy, i.e., one that is guaranteed to satisfy a property 4> when complied with. 
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Definition 3.4 (Sound multi-strategy). A multi-strategy 9 for game G is sound for a 
property f if an y strategy a that complies with 9 is sound for <f. 

Example 3.5. We return again to the stochastic game from Ex. |2.2| (see Fig. [lj and re-use 
the property cf = R < 5 Ues [ C ]. A strategy that picks south in «o an d east in s 3 results in an 
expected reward of 3.5 (i.e., 3.5 moves on average to reach s 5 ). A strategy that picks east in 
so and south in s 2 yields expected reward 5. Thus, a (deterministic) multi-strategy 9 that 
picks {south, east} in so, {south} in s 2 and {east} in S 3 is sound for </> since the expected 
reward is always at most 5. 


3.2. Penalties and Permissivity. The motivation behind synthesising multi-strategies is 
to offer flexibility in the actions to be taken, while still satisfying a particular property if. 
Generally, we want a multi-strategy 9 to be as permissive as possible, i.e., to impose as few 
restrictions as possible on actions to be taken. We formalise the notion of permissivity by 
assigning penalties to actions in the model, which we then use to quantify the extent to 
which actions are disallowed by 9. Penalties provide expressivity in the way that we quantify 
permissivity: if it is more preferable that certain actions are allowed than others, then these 
can be assigned higher penalty values. 

A penalty scheme is a pair (if, t), comprising a penalty function if : S§ x A -» Q^o and a 
penalty type t E {sta, dyn}. The function if represents the impact of disallowing each action 
in each controller state of the game. The type r dictates how penalties for individual actions 
are combined to quantify the permissivity of a specific multi-strategy. For static penalties 
(t = sta), we simply sum penalties across all states of the model. For dynamic penalties 
(r = dyn), we take into account the likelihood that disallowed actions would actually have 
been available, by using the expected sum of penalty values. 

More precisely, for a penalty scheme (if, r) and a multi-strategy 9, we define the resulting 
penalty for 9, denoted pen T (if,9) as follows. First, we define the local penalty for 9 at state 
s S So as pen loc (if,9,s) =Y.BCA(s)T,a<£B e ( s )( B W’( s i a )- If 6 is deterministic, pen loc (if,9,s) 
is simply the sum of the penalties of actions that are disallowed by 9 in s. If 9 is randomised, 
penioc(if,9, s) gives the expected penalty value in s, i.e., the sum of penalties weighted by 
the probability with which 9 disallows them in s. 

Now, for the static case, we sum the local penalties over all states, i.e., we put: 

Pen sta (il’, 9) = P en ioc{^i «)• 

For the dynamic case, we use the (worst-case) expected sum of local penalties. We define an 
auxiliary reward structure if 9 ew given by the local penalties: if 9 ew (s,a) = pen loc (if,9, s) for 
all s E So and a E A(s), and if 9 ew (s,a) = 0 for all s E Sb and a E M(s). Then: 

pendynhfi s) = sup {E^f.(if 9 ew ) | <7 E 7T E E^ and a complies with 9}. 

We use pen dyn (if, 9) = pen dyn (if, 9,s) to reference the dynamic penalty in the initial state. 

3.3. Permissive Controller Synthesis. We can now formally define the central problem 
studied in this paper. 

Definition 3.6 (Permissive controller synthesis). Consider a game G, a class of multi¬ 
strategies -k E {det, rand}, a property if, a penalty scheme (if, r) and a threshold c E Q^o- 
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The permissive controller synthesis problem asks: does there exist a multi-strategy 9 £ 0£ 
that is sound for cp and satisfies pen T (ip, 9) ^ c? 


Alternatively, in a more quantitative fashion, we can aim to synthesise (if it exists) an 
optimally permissive sound multi-strategy. 


Definition 3.7 (Optimally permissive). Let G, ★, i p and (ip,r) be as in Defn. 3.6 A sound 


multi-strategy 9 £ 0£ is optimally permissive if its penalty pen r (ip,9) equals the infimum 
inf {pen T (ip, 9) \ 9 £ 0£ and 9 is sound for cp}. 


Example 3.8. We return to Ex. 3.5 and consider a static penalty scheme (ip, sta) assigning 


1 to the actions north, east, south (in any state). The deterministic multi-strategy 9 from 
Ex. 3.5 is optimally permissive for cp = R™g ,,es [c], with penalty 1 (just north in s 3 is 
disallowed). If we instead use cj)' = R™°g es [C], the multi-strategy 9' that extends 9 by also 
allowing north is now sound and optimally permissive, with penalty 0. Alternatively, the 
randomised multi-strategy 9" that picks {north} with probability 0.7 and {north, east} with 
probability 0.3 in S 3 is sound for cj) with penalty just 0.7. 


It is important to point out that penalties will typically be used for relative comparisons 
of multi-strategies. If two multi-strategies 9 and 9' incur penalties x and x' with x < x', 
then the interpretation is that 9 is better than 9'\ there is not necessarily any intuitive 
meaning assigned to the values x and x' themselves. Accordingly, when modelling a system, 
the penalties of actions should be chosen to reflect the actions’ relative importance. This is 
different from rewards, which usually correspond to a specific measure of the system. 

Next, we establish several fundamental results about the permissive controller synthesis 
problem. Proofs that are particularly technical are postponed to the appendix and we only 
highlight the key ideas in the main body of the paper. 

Optimality. Recall that two key parameters of the problem are the type of multi-strategy 
sought (deterministic or randomised) and the type of penalty scheme used (static or dynamic). 
We first note that randomised multi-strategies are strictly more powerful than deterministic 
ones, i.e., they can be more permissive (yield a lower penalty) whilst satisfying the same 
property cp. 

Theorem 3.9. The answer to a permissive controller synthesis problem (for either a static 
or dynamic penalty scheme) can be “no” for deterministic multi-strategies, but “yes” for 
randomised ones. 


Proof. Consider an MDP with states s, t\ and t 2 , and actions ai and 02 , where 5(s, ai)(ti) = 1 
for i £ {1,2}, and t\,t 2 have self-loops only. Let r be a reward structure assigning 1 to 
(s, a 1) and 0 to all other state-action pairs, and ip be a penalty function assigning 1 to (s, 02 ) 
and 0 elsewhere. We then ask whether there is a multi-strategy satisfying cp = R> 0 5 [C] with 
penalty at most 0.5. 

Considering either static or dynamic penalties, the randomised multi-strategy 9 that 
chooses distribution 0.5:{oi} + 0 . 5 :{o 2 } in s is sound and yields penalty 0.5. However, there 
is no such deterministic multi-strategy. □ 

This is why we explicitly distinguish between classes of multi-strategies when defining 
permissive controller synthesis. This situation contrasts with classical controller synthesis, 
where deterministic strategies are optimal for the same classes of properties cp. Intuitively, 
randomisation is more powerful in this case because of the trade-off between rewards 
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and penalties: similar results exist in, for example, multi-objective controller synthesis on 
MDPs [H]. 

Next, we observe that, for the case of static penalties, the optimal penalty value for a 
given property (the infimum of achievable values) may not actually be achievable by any 
randomised multi-strategy. 

Theorem 3.10. For permissive controller synthesis using a static penalty scheme, an 
optimally permissive randomised multi-strategy does not always exist. 


Proof. Consider a game with states s and t, and actions a and b , where we define 
6(s,a)(s) = 1 and 5(s,b)(t ) = 1, and t has just a self-loop. The reward structure r assigns 1 
to ( s , b ) and 0 to all other state-action pairs. The penalty function 'if assigns 1 to (s, a) and 
0 elsewhere. 

Now observe that any multi-strategy which disallows the action a with probability e > 0 
and allows all other actions incurs penalty e and is sound for R> 1 [C], since any strategy 
which complies with the multi-strategy leads to action b being taken eventually. Thus, the 
infimum of achievable penalties is 0. However, the multi-strategy that incurs penalty 0, i.e. 
allows all actions, is not sound for R >j [C]. □ 

If, on the other hand, we restrict our attention to deterministic strategies, then an 
optimally permissive multi-strategy does always exist (since the set of deterministic, memo¬ 
ryless multi-strategies is finite). For randomised multi-strategies with dynamic penalties, 
the question remains open. 

Complexity. Next, we present complexity results for the different variants of the permissive 
controller synthesis problem. We begin with lower bounds. 

Theorem 3.11. The permissive controller synthesis problem is NP-hard, for either static 
or dynamic penalties, and deterministic or randomised multi-strategies. 


We prove NP-hardness by reduction from the Knapsack problem, where weights of 
items are represented by penalties, and their values are expressed in terms of rewards to be 
achieved. The most delicate part is the proof for randomised strategies, where we need to 
ensure that the multi-strategy cannot benefit from picking certain actions (corresponding to 
items being put into the Knapsack) with probability other than 0 or 1. See Appx. A.l for 
details. For upper bounds, we have the following. 


Theorem 3.12. The permissive controller synthesis problem for deterministic (resp. ran¬ 
domised ) strategies is in NP (resp. PSPACE) for dynamic/static penalties. 


For deterministic multi-strategies, it is straightforward to show NP membership in both 
the dynamic and static penalty case, since we can guess a multi-strategy satisfying the 
required conditions and check its correctness in polynomial time. For randomised multi¬ 
strategies, with some technical effort, we can encode existence of the required multi-strategy 
as a formula of the existential fragment of the theory of real arithmetic, solvable with 
polynomial space [7]. See Appx. A.2 A natural question is whether the PSPACE upper 
bound for randomised multi-strategies can be improved. We show that this is likely to be 
difficult, by giving a reduction from the square-root-sum problem. 


Theorem 3.13. There is a reduction from the square-root-sum problem to the permissive 
controller synthesis problem with randomised multi-strategies, for both static and dynamic 
penalties. 
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We use a variant of the problem that asks, given positive rationals x\,... ,x n and y, 
whether 1 ^ 2/- This problem is known to be in PSPACE, but establishing a better 

complexity bound is a long-standing open problem in computational geometry [17]. See 
Appx. |A.3| for details. 


4. MILP-Based Synthesis of Multi-Strategies 

We now consider practical methods for synthesising multi-strategies that are sound for a 
property cp and optimally permissive for some penalty scheme. Our methods use mixed 
integer linear programming (MILP), which optimises an objective function subject to linear 
constraints that mix both real and integer variables. A variety of efficient, off-the-shelf MILP 
solvers exists. 

An important feature of the MILP solvers we use is that they work incrementally, 
producing a sequence of increasingly good solutions. Here, that means generating a series of 
sound multi-strategies that are increasingly permissive. In practice, when computational 
resources are constrained, it may be acceptable to stop early and accept a multi-strategy 
that is sound but not necessarily optimally permissive. 

Here, and in the rest of this section, we assume that the property cp is of the form R> 6 [C]. 
Upper bounds on expected rewards (cp = R r <6 [C]) can be handled by negating rewards and 
converting to a lower bound. For the purposes of encoding into MILP, we rescale r and 
b such that sup^ E^(r) < 1 for all s, and rescale every (non-zero) penalty such that 
ip(s, a) ^ 1 for all s and a E A(s). 

We begin by discussing the synthesis of deterministic multi-strategies, first for static 
penalties and then for dynamic penalties. Subsequently, we present an approach to synthe¬ 
sising approximations to optimal randomised multi-strategies. In each case, we describe 
encodings into MILP problems and prove their correctness. We conclude this section with a 
brief discussion of ways to optimise the MILP encodings. Then, in Section [5j we investigate 
the practical applicability of our techniques. 


4.1. Deterministic Multi-Strategies with Static Penalties. Fig. [2] shows an encoding 
into MILP of the problem of finding an optimally permissive deterministic multi-strategy 
for property <p = R>JC] and a static penalty scheme (ip, sta ). The encoding uses 5 types of 
variables: y SA E {0,1}, x s E [0,1], a s E {0,1}, (3 s ^ t E {0,1} and G [0,1], where s,t E S 
and a E A. The worst-case size of the MILP problem is C ) (|A|-|S'| 2 -k), where k stands for 
the longest encoding of a number used. 

Variables y S) a encode a multi-strategy 6 as follows: y s , a has value 1 iff 6 allows action a in 
s E So (constraint (4.2) enforces at least one allowed action per state). Variables x s represent 
the worst-case expected total reward (for r) from state s, under any controller strategy 
complying with 8 and under any environment strategy. This is captured by constraints 
ij (which are analogous to the linear constraints used when minimising the reward 


([4^-f_ 

in an MDP). Constraint (4.1) puts the required bound of b on the reward from s. 

The objective function minimises the static penalty (the sum of all local penalties) minus 
the expected reward in the initial state. The latter acts as a tie-breaker between solutions 
with equal penalties (but, thanks to rescaling, is always dominated by the penalties and 
therefore does not affect optimality). 

As an additional technicality, we need to ensure the values of x s are the least solution 
of the defining inequalities, to deal with the possibility of zero reward loops. To achieve this, 
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Minimise: - Xg + V' (1 - y s a )-i/j(s, a) subject to: 

z z —' aGA(s) 


Xs> b 


(4.1) 

1 ^ ^aeA(.s) Vs ' a 

for all s G So 

(4.2) 

x s < a )(t)-xt + r(s, a) + (1 - y s ,a) 

for all s G So, a G A(s) 

(4.3) 

x s < y ^ S(s,a)(t)-x t + r(s,a) 

for all s G Sn, a G .A(s) 

(4.4) 

x s < a s 

for all s G S 

(4.5) 

y s ,a = (1 - Os) +J2 tes upp(8(s,a)) ^ 

for all s G S, a G .A(s) 

(4.6) 

Vs,a = 1 

for all s G Sn, a G A(s) 

(4.7) 

It < Is + (1 - f} s ,a,t) + r(s, a) 

for all s, a, t with t G supp(5(s, a)) 

(4.8) 


Figure 2: MILP encoding for deterministic multi-strategies with static penalties. 


Minimise: z-g subject to (4.1. ,(4.8) and: 


4 = - -y*,< 


'a£A(s) 



for all s G So 

(4.9) 

Vs,a) 

for all s G So, a G A(s) 

(4.10) 


for all s G Sn, a G A(s) 

(4.11) 


'tes v /v 7 

Figure 3: MILP encoding for deterministic multi-strategies with dynamic penalties. 


we use an approach similar to the one taken in m- It is sufficient to ensure that x s = 0 
whenever the minimum expected reward from s achievable under 6 is 0, which is true if and 
only if, starting from s, it is possible to avoid ever taking an action with positive reward. 

In our encoding, a s = 1 if x s is positive (constraint ( |4.5[ )). The binary variables 
(3 S ,a,t = 1 represent, for each such s and each action a allowed in s, a choice of successor 
t = t(s,a) G supp(5(s, a)) (constraint (4.6)). The variables 7 S then represent a ranking 
function: if r(s,a) = 0, then > 7t( SjQ ) (constraint (4.8)). If a positive reward could be 
avoided starting from s, there would in particular be an infinite sequence so, ai, si, ■ ■ ■ with 
s 0 = s and, for all i, either (i) x Si > x Si+1 , or (ii) x Si = x Si+1 , Sj +i = t(s*, a*) and r(sj, a;) = 0 , 
and therefore > 7 . 5 i+1 . This means that the sequence (a; So , 7 So ), (x Sl ,j Sl ),... is (strictly) 
decreasing w.r.t. the lexicographical order, but at the same time S is finite, and so this 
sequence would have to enter a loop, which is a contradiction. 


Correctness. Before proving the correctness of the encoding (stated in Theorem 4.2 


below), we prove the following auxiliary lemma that characterises the reward achieved under 
a multi-strategy in terms of a solution of a set of inequalities. 
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Lemma 4.1. Let G = (Sq, Sb, s, A, 5) be a stochastic game, (p = R r >fe [C] a property, (ip, sta ) 
a static penalty scheme and 9 a deterministic multi-strategy. Consider the inequalities: 

x s ^ min oe0 ( s ) J2 s ' & s a )( s ') x s' + r(s, a) for s £ Sq 
x s ^ min aeA(s) '£ is ' &s 8(s,a)(s > )x s > + r(s,a) for s £ S a . 

Then the following hold: 

• x s = inf a< Q w E^(r) is a solution to the above inequalities. 

• A solution x s to the above inequalities satisfies x s ^ inf a <e,ir EQ ,7 ^(r) for all s whenever 
the following condition holds: for every s with x s > 0, every o < 6 and every it there is a 
path oj = so«o • • • s n a n starting in s that satisfies PrfEju) > 0 and r(s n , a n ) > 0 . 

Proof. The game G, together with 9, determines a Markov decision process G e = ( 0 , S§ U 
Sa,s, A, 5') in which the choices disallowed by 9 are removed, i.e. 5'(s,a) is equal to 5(s,a) 
for every s£Sq and every s £ S$ with a £ 9(s), and is undefined for any other combination 
of s and a. We have: 

inf E°’*(r)=infE° g (r) 

cr<6,ir ’ cr ’ 

since, for any strategy pair a < 9 and tt in G, there is a strategy a in G 9 which is defined, 
for every finite path u of G 9 ending in t , by a(u) = <r(u) or a(u ;) = 7 r(w), depending on 
whether t £ So or t £ Sb, and which satisfies E^’ n s (r) = EZ e (r). Similarly, a strategy a 
for G e induces a compliant strategy a and a strategy it defined for every finite path ui of G 
ending in Sq (resp. Sn) by cr(uj) = a(u) (resp. vr(w) = ct(w)). 

The rest is then the following simple application of results from the theory of Markov 
decision processes. The first item of the lemma follows from [25_i Theorem 7.1.3], which 
gives a characterisation of values in MDPs in terms of Bellman equations; the inequalities in 
the lemma are in fact a relaxation of these equations. For the second part of the lemma, 
observe that if, inf!(b is infinite, then the claim holds trivially. Otherwise, from 
the assumption on the existence of u: we have that, under any compliant strategy, there is a 
path (J = soao^i... s n of length at most |5| in G 9 such that inf^^ Eq™ (r) = 0 (otherwise 
the reward would be infinite) and so x Sn = 0. We can thus apply )25L Proposition 7.3.4], 
which states that a solution to our inequalities gives optimal values whenever under any 
strategy the probability of reaching a state s with x s = 0 is 1. Note that the result of [23] 
applies for maximisation of reward in “negative models”; our problem can be easily reduced 
to this setting by multiplying the rewards by —1 and looking for maximising (instead of 
minimising) strategies. □ 

Theorem 4.2. Let G be a game, f = R> 6 [C] a property and (ip, sta) a static penalty 
scheme. There is a sound multi-strategy in G for cp with penalty p if and only if there is an 
optimal assignment to the MILP instance from Fig. |I| which satisfies p = X^eSo SaeA(s)(l — 

ys,a)-i>(s,a). 


Proof. We prove that every multi-strategy 9 induces a satisfying assignment to the variables 
such that the static penalty under 9 is ^ sg 5 0 SaeA(s)(^ — ys,a)'fp(s,a), and vice versa. The 
theorem then follows from the rescaling of rewards and penalties that we performed. 

We start by proving that, given a sound multi-strategy 9, we can construct a satisfying 
assignment {y s , a , x s , a s , j3 s ,a,t, ft} s ,teS,a&A to the constraints from Fig. [ 2 } For s £ So and 
a £ A(s) we set y s , a = 1 if a £ 9(s), and otherwise we set y s , a = 0. This gives satisfaction of 
contraint (|4.2|). For s£Sq and a £ A(s) we set y s , a = 1, ensuring satisfaction of (4.7). We 
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then put x s = inf CT< 0 i 7 r Eq s {t). By the first part of Lemma 4.1 we get that constraints (4.1), 


(4.3) (for a £ 9(s)) and (4.4) are satisfied. Constraint (4.3) for a ^ 9(s) is satisfied because 


in this case y S) a = 0 , and so the right-hand side is always at least 1 . 

We further set a s = 1 if x s > 0 and a s = 0 if x s = 0, thus satisfying constraint (4.5). 


For a state s, let d s be the maximum distance to a positive reward. Formally, the values d s 
are defined inductively by putting d s = 0 for any state s such that we have r(s, a) > 0 for 
all a £ .A(s), and then for any other state s: 

d s = 1 + min max dt if s £ S<> 

aE0(s),r(s,a)=0 5(s,a)(t)> 0 

d s = 1 + min max dt if s £ Sb 

aGA(s),r(s,a)=0 <5(s,a)(t)>0 

Put d s = _L if d s was not defined by the above. For s such that d s / _L, we put % = d s /|£|, 
and for every a we choose t such that dt < d s , and set j3 s ,a,t = 1 , leaving f3 s ,a,t = 0 for all 
other t. For s such that d s = _L we define = 0 and for all a and t put f} s ,a,t = 0. This 
ensures the satisfaction of the remaining constraints. 

In the opposite direction, assume that we are given a satisfying assignment. Firstly, we 
create a game G 7 from G by making any states s with x s = 0 sink states (i.e. imposing a 
self-loop with no penalty on s and removing all other transitions). Any sound multi-strategy 
6 for 4> in G 7 directly gives a sound multi-strategy 0' for 0 in G defined by 9'(s) = 9(s) for 
states s £ Sq with x s > 0 , and otherwise letting 6 allow all available actions. 

We construct 6 for G 7 by putting 9(s) = {a £ A(s) | y S)a = 1} for all s £ S§ with 
x s > 0, and by allowing the self-loop in the states s £ Sq with x s = 0; note that 9(s) is 
non-empty by constraint ( |4.2[ ). First, by definition, the multi-strategy yields the penalty 
YlseSo SaeA(s)(l — ys,a)-ip(s , a). Next, we will show that 9 satisfies the assumption of the 
second part of Lemma |4.1| from which we get that: 

inf Erf(r) ^ x s 


<7<0,7T 


j with x Si = x Sj 


which, together with constraint (4.1) being satisfied, gives us the desired result. 

Consider any s such that inf CT< , 0 )7r Eq? (r) > 0. Then we have x s > 0 (by the definition 
of G 7 ). Let us fix any a < 6 and any 7 r, and let so = s. We show that there is a path u 
satisfying the assumption of the lemma. We build ui = sq ... s n a n inductively, to satisfy: (i) 
r(s n ,a n ) > 0 , (ii) x Si ^ x Si _ 1 for all i, and (iii) for any sub-path SjOj... s 
we have that 7 Sk < ^y Sk _ 1 for alii + 1 ^ k ^ j. 

Assume we have defined a prefix soao • • • s* to satisfy conditions (ii) and (iii). We put a* 
to be the action picked by a (or 7 r) in . 7 . If r(si,at) > 0, we are done. Otherwise, we pick 
Sj_|_i as follows: 

• If there is s' £ supp(5(si, at)) with x s > > x s , then we put = s'. Such a choice again 
satisfies (ii) and (iii) by definition. 

• If we have xy = x s for all s' £ supp(5(si , a*)), then any choice will satisfy (ii). To satisfy 
the other conditions, we pick sy 1 so that i3 Si ,a. i ,s i+1 = 1 is true. We argue that such an Sj+i 


can be chosen. We have x Si > 0 and so « s = 1 by constraint (4.5). We also have y s , a = 1: 
for s £ Sty this fo llow s from the definition of 9, for s £ Sb from constraint (4.7). Hence, 


since constraint (4.6) is satisfied, there must be Sj+i such that /3 Si ,a,. s i+ i = 1. Then, we 
apply constraint (4.8) (for s = Si, t = s*+i and a = a*) and, since the last two summands 
on the right-hand side are 0 , we get 7 y +1 < % i , thus satisfying (iii). 
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Note that the above construction must terminate after at most |S| steps since, due to 
conditions (ii) and (iii), no state repeats on ui. Because the only way of terminating is 
satisfaction of (i), we are done. □ 


4.2. Deterministic Multi-Strategies with Dynamic Penalties. Next, we show how 
to compute a sound and optimally permissive deterministic multi-strategy for a dynamic 
penalty scheme (if, dyn ). This case is more subtle since the optimal penalty can be infinite. 
Hence, our solution proceeds in two steps as follows. 

Initially, we determine if there is some sound multi-strategy. For this, we just need 
to check for the existence of a sound strategy, using standard algorithms for solution of 
stochastic games mm- If there is no sound multi-strategy, we are done. Otherwise, we 
use the MILP problem in Fig. [3] to determine the penalty for an optimally permissive sound 
multi-strategy. This MILP encoding extends the one in Fig. [2] for static penalties, adding 
variables £ s and z s , representing the local and the expected penalty in state s, and three 
extra sets of constraints. First, (4.9) and (4.10) define the expected penalty in controller 


states, which is the sum of penalties for all disabled actions and those in the successor states, 
multiplied by their transition probabilities. The behaviour of environment states is then 


captured by constraint (4.11), where we only maximise the penalty, without incurring any 
penalty locally. 


The constant c in (4.10) is chosen to be no lower than any finite penalty achievable by a 


deterministic multi-strategy, a possible value being: 

i-\S\ ■ pen max 


OO 

E 

i =0 




y. p \s\ 


(4.12) 


where p is the smallest non-zero probability assigned by <5, and pen max is the maximal local 


penalty over all states. To see that (4.12) indeed gives a safe bound on c (i.e. it is lower than 


any finite penalty achievable), observe that for the penalty to be finite under a deterministic 
multi-strategy, for every state s there must be a path of length at most |Sj to a state from 
which no penalty will be incurred. This path has probability at least pi 5 !, and since the 
penalty accumulated along a path of length i ■ \S\ is at most i ■ |<Sj • pen max , the properties of 
(4.12) follow easily. 


If the MILP problem has a solution, this is the optimal dynamic penalty over all sound 
multi-strategies. If not, no deterministic sound multi-strategy has a finite penalty and 
the optimal penalty is oo (recall that we already established there is some sound multi¬ 
strategy). In practice, we might choose a lower value of c than the one above, resulting in a 
multi-strategy that is sound, but possibly not optimally permissive. 

Correctness. Formally, correctness of the MILP encoding for the case of dynamic penalties 
is captured by the following theorem. 


Theorem 4.3. Let G be a game, <f> = R r ^ b [C] a property and {'ip, dyn) a dynamic penalty 
scheme. Assume there is a sound multi-strategy for cp. The MILP formulation from Fig. [3] 
satisfies: (a) there is no solution if and only if the optimally permissive deterministic multi¬ 
strategy yields infinite penalty; and (b) there is a solution zj if and only if an optimally 
permissive deterministic multi-strategy yields penalty Zj- 
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Proof. We show that any sound multi-strategy with finite penalty z-g gives rise to a satisfying 
assignment with the objective value zj, and vice versa. Then, (b) follows directly, and (a) 
follows by the assumption that there is some sound multi-strategy. 

Let us prove that for any sound multi-strategy 9 we can construct a satisfying assignment 


as in the proof of Theorem 4.2 
4 = pen loc (ip, 4 s 


to the constraints. For constraints (4.1) to (4.8), the construction works exactly the same 

For the newly added variables, i.e. z s and i s 


ensuring satisfaction of constraint (4.9), and: 


we put 


= sup E\ 

<J<0,7T 


<<T,7T 

G,s 


(4 


9 ) 

rew) 


which, together with [251 Section 7.2.7, Equation 7.2.17] (giving characterisation of optimal 


reward in terms of a linear program), ensures that constraints (4.10) and 4.11 are satisfied 


In the opposite direction, given a satisfying assignment we construct 9 for G' exactly 


as in the proof of Theorem 4.2 As before, we can argue that constraints (4.1) to (4.8) are 


satisfied under any sound multi-strategy. We now need to argue that the multi-strategy 
satisfies pen dyn {ip, 9, s) 4 Zj. It is easy to see that peni oc (if, 9, s) = t s . Moreover, by 
Section 7.2.7, Equation 7.2.17] the penalty is the least solution to the inequalities: 

^ 6(s,a)(s) ■ z’ s , +4 


4 4 


max 

aS0(s) 


s'SS 


4 max y 5(t 


a)(s) 


for all s G So 


for all s € S\j 


We can replace (4.13) with: 


^ max 
aeA(s) 


E 

s'eS 


5(s, a)(s) ■ z' s , + 4 - c ■ (1 - y Sta ) 


(4.13) 


(4.14) 


(4.15) 


since for a G 9(s) we have c • (1 — y s ,a) = 0 and othe rwise c • (1 — y s ,a) is greater than 
Yls'es < H S ’ °)( s ) ' z ' s > + 4 in the least solution to (4.13) and (4.14), by the definition of c. 


Finally, it suffices to observe that the set of solutions to (4.14) and (4.15) is the same as the 
set of solutions to (4.10) and (4.11). □ 


4.3. Approximating Randomised Multi-Strategies. In Section [3j we showed that 
randomised multi-strategies can outperform deterministic ones. The MILP encodings in 
Fig.s[2] and [3j though, cannot be adapted to the randomised case, since this would need 
non-linear constraints (intuitively, we would need to multiply expected total rewards by 
probabilities of actions being allowed under a multi-strategy, and both these quantities 
are unknowns in our formalisation). Instead, in this section, we propose an approximation 
which finds the optimal randomised multi-strategy 9 in which each probability 9(s)(B) is a 
multiple of jj for a given granularity M. Any such multi-strategy can then be simulated 
by a deterministic one on a transformed game, allowing synthesis to be carried out using 
the MILP-based methods described in the previous section. Before giving the definition 
of the transformed game, we show that we can simplify our problem by restricting to 
multi-strategies which in any state select at most two actions with non-zero probability. 

Theorem 4.4. Let G be a game, = R> 6 [C] a property, and a (static or dynamic) 

penalty scheme. For any sound multi-strategy 9 we can construct another sound multi-strategy 
9' such that pen T (ijj,9) 4 pen r (ip,9') and \supp(9'(s))\ 4 2 for any s G Sq. 
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Proof. If the (dynamic) penalty under 9 is infinite, then the solution is straightforward: 
we can simply take 9' which, in every state, allows a single action so that the reward is 
maximised. This restrictive multi-strategy enforces a strategy that maximises the reward 
(so it performs at least as well as any other multi-strategy), and at the same time it cannot 
yield the dynamic penalty worse than 9, as the dynamic penalty under 9 is already infinite. 
From now on, we will assume that the penalty is finite. 

Let 9 be a multi-strategy allowing n > 2 different sets Ai,...,A n with non-zero 
probabilities Ai,..., A n in si G £<>. We construct a multi-strategy 9' that in si allows only 
two of the sets Ai,..., A n with non-zero probability, and in other states behaves like 9. 

We first prove the case of dynamic penalties and then describe the differences for static 


is: 


inf E^(r)= inf + Pr%(F s,) • ^(r)) 

a<u,7r ’ \ ’ ’ ’ ± / 


iftf( 7 <] 0 , 7 r 

= 

inf 


a<0,7T 

— 

inf 


a<0,7r 


inf 


a<0,n 

— 

inf 


(7<9' , 7 T 


-!<T, 7 r / 

J Q ,s V 


J G,s ' 


(7, 7T / 

r G?( 


( 7 , 7 T / 

r 

G,s' 


cr'<9, 7r 


a'<0' ,tt' 


i(j ,7r 


a'<0' , 7 r' 


j-j f- 

, G,si 


(*) 


= inf EfA(r) 

<T<e',ir G ’ sV ’ 


where the equation (*) above follows by the fact that, up to the first time si is reached, 
9 and 9 l allow the same actions. Hence, it suffices to define 9' so that inf^^ Eq* (r) ^ 

inf CT< ] 0 / )7r Eq ^ (r). Similarly, for the penalties, it is enough to ensure sup^ )7r Eq* (iffew) ^ 
su P<7<0',7r E Q n sM'rew)- 

Let Pi and Rj, where i G {1, ...,n}, be the penalties and rewards from 9 after allowing 
Ai against an optimal opponent strategy, i.e.: 


Pi = ^2 a ) + SU P max V«5(si,a)( S ') • 

Ri= inf min(r(si,a) + Y' S(si,a)(s') • E%*,(r)) 
a<e,TT a&Ai 

s'eS 


We also define R = inf a <d,n E^ Si (r) and P = sup^^ E™ Si {'f° re w) and have R = E”=i ^P/ 
and P = EILi ^i p i - 

Let Sq C S be those states for which there are a<9 and 7r ensuring a return to si without 
accumulating any reward. Formally, Sq contains all states so which satisfy Pr(E 0 (F {si}) = 1 
and -F(E o (7\l.si) for some a <9 and ir. We say that A t is progressing if for all a G A % we have 
r(si,a) > 0 or supp(6(s\, a)) %. So- We note that A % is progressing whenever Ri > R (since 
any a violating the condition above could have been used by the opponent to force Rj ^ R). 

For each tuple /i = (pi,..., p n ) G R n , let R M = p\R\ + • • • + p n Rn and P M = p\P\ + 
• • • + /i n Pn■ Then the set T = {(i? M , P M ) | 0 ^ m ^ 1, fi\ + ■ ■ ■ + n n = 1} is a bounded convex 
polygon, with vertices given by images ( R e \P ei ) of unit vectors (i.e., Dirac distributions) 
e* = (0,..., 0,1,0,..., 0), and containing (R x , P x ) = (. R , P). To each vertex ( Rj , Pf) we 
associate the (non-empty) set Ij = {i | ( R ei ,P ei ) = (Rj,Pj)} of indices. 
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We will find a G (0,1) and 1 ^ u, v ^ n such that one of A u or A v is progressing, and 
define the multi-strategy O' to pick A u and A v with probabilities a and 1 — a, respectively. 
We distinguish several cases, depending on the shape of T: 

(1) T has non-empty interior. Let (Pi, Pi),..., (R m , Pm) be its vertices in the anticlockwise 
order. Since all Aj are positive, ( R , P) is in the interior of T. Now consider the point 
(P, P') directly below ( R,P ) on the boundary of T, i.e. P' = min{P" | (P, P") G T}. 
If (P, P') is not a vertex, it is a convex combination of adjacent vertices ( R,P ') = 
a(Rj , Pj) + (1 — a)(Rj+i, Pj+i), and we pick such a and u G Ij and v G Ij+i- If (P, P ') 
happens to be a vertex ( Rj, Pj) we can (since Pj < P) instead choose sufficiently small 
a > 0 so that P ^ aRj + (1 — a)Rj + \ and P ^ aPj + (1 — a)Pj + \ and again pick 
u G Ij and v G Ij+i- In either case, we necessarily have Rj+i > R (by ordering of the 
vertices in the anticlockwise order and since (P, P) is in the interior of T), and so A v is 
progressing. 

(2) T is a vertical line segment, i.e. it is the convex hull of two extreme points (P, Pq) and 
(P, Pi) with Pq < Pi. In case P = 0, we can simply always allow some Aj with i G Iq, 
minimising the penalty and still achieving reward 0 . 

If P > 0, there must be at least one progressing A u . Since all A* are positive, (P, P) 
lies inside the line segment, and in particular P > Pq. We can therefore choose some v 
and a G (0,1) such that P ^ a ■ P u + (1 — a) • P v . 

(3) T is a non-vertical line segment, i.e. it is the convex hull of two extreme points ( Rq , Pq) 
and (Pi, Pi) with Rq < Pi. Since all A i are positive, (P, P) is not one of the extreme 
points, i.e. (P, P) = a(Po,Po) + (1 — a)(Pi, Pi) with 0 < a < 1. We can therefore 
choose iiG/o,t)G/i Again, since Pi > P, A v is progressing. 

(4) T consists of a single point (P, P). This can be treated like the second case: either 
P = 0, and we can allow any combination, or P > 0, and there is some progressing A u , 
and we then pick arbitrary v and a. 

We now want to show that the reward of the updated multi-strategy is indeed no worse than 
before. For i G {u, v} we define: 

P'i = inf mm(r(«i,fl) + ^ <S(si,o)(s') • E^,(r)) 

(j<0' ,7r a£Ai ’ 

s'eS 

and we define R' = aR' u + (1 — a)R' v . Pick an action a (resp. a') that realises the minimum 
and strategies a and 7 r (resp. a' and ir') that realise the infimum in the definition of Pj 
(resp. P'). (Such strategies indeed exist). Define: 

°i = X^( Sl ’ a )( s 0 ' Pr G ,s'( F s i) c 'i = ^2 5 (si,a)(s') • Pfg’jCF Si) 

s' s' 

di = r(s, a) + ^(5(si,a)(s') • Pjj,(r|si) d\ = r(s,a) + ^ h(si, a 1 ) (s') ■ (4«i) 
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Figure 4: A node in the original game G (left), and the corresponding nodes in the trans¬ 
formed game G 7 for approximating randomised multi-strategies (right, see Sec¬ 
tion 14.31) . 


We have Ri = c* • R + di for every 1 ^ i d n, and R,[ = c 7 • R' + d! i for all i £ {it, v}. Then: 

R! — R = (aR 1 + (1 — <y)R' v ) — KRi 

= (aR u + (1 — a)R' v ) — ( czR u + (1 — a)R v ) + ( o:R u + (1 — u)R v ) — N A iRi 
d (aR u + (1 — a)R' v ) — ( cxR u + (1 — a)R v ) 

(by the choice of a, u, v ) 

= 4" d'u) T (1 — a){d v R! + d' v )) — (cx.{c u R + d u ) + (1 — ot)(c v R + d v )) 

^ (a(c^i?' + d! u ) + (1 - a)(d v R! + d ' v )) — (a(c' u R + d ' u ) + (1 — a)(c' v R + d ! v )) 

(ciR + di d c[R + d 7 by optimality of actions/strategies defining c* and di) 

= ( ad u + (1 - a)c' v )(R' - R), 

i.e., (1 — ac' u — (1 — a)c' v ){R' — R) d 0 . By finiteness of rewards and the choice of 0(si), 
at least one of the return probabilities d u , d v is less than 1 , and thus so is ad u + (1 — a)d v , 
therefore R! d R. 

We can show that the penalty under O' is at most as big as the penalty under 0 in 
exactly the same way (note that in addition using drew instead of drew f° r c> > d!, R' and R'i)- 
For static penalties, the proof that the new multi-strategy is no worse than the old one is 
straightforward from the choice of d 7 (si). □ 

The result just proved allows us to simplify the game construction that we use to map 
between (discretised) randomised multi-strategies and deterministic ones. Let the original 
game be G and the transformed game be G 7 . The transformation is illustrated in Fig. |4} 
The left-hand side shows a controller state s £ So in the original game G (i.e., the one for 
which we are seeking randomised multi-strategies). For each such state, we add the two 
layers of states illustrated on the right-hand side of the figure: gadgets s 7 1; s 2 representing the 
two subsets B C A(s) with 0(s)(B) > 0, and selectors Si (for 1 d i d m), which distribute 
probability among the two gadgets. Two new actions, b\ and 62 , are also added to label the 
transitions between selectors s* and gadgets s[, s' 2 . 

The selectors si are reached from s via a transition using fixed probabilities pi,,p rn 
which need to be chosen appropriately. For efficiency, we want to minimise the number of 
selectors m for each state s. We let m = [1 + log 2 M\ and Pi = jfa, where l\ ..., l m £ N 
are defined recursively as follows: l\ = [~y^~| and U = j~ jU ~(h+^--+^-i) -j f or 2 ^ ^ m. For 

example, for M=10, we have m = 4 and h,. ■ ■ ,l± = 5,3,1,1, so pi, ... ,p^ = yjj, ^ 5 , yy,. 
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We are now able to find optimal discretised randomised multi-strategies in G by finding 


optimal deterministic multi-strategies in GL This connection will be formalised in Lemma 4.5 


below. But we first point out that, for the case of static penalties, a small transformation 
to the MILP encoding (see Fig. [ 2 ]) used to solve game G' is required. The problem is that 
the definition of static penalties on G' does not precisely capture the static penalties of the 
original game G. In this, case we adapt Fig. [ 2 ] as follows. For each state s, action a E T(s) 
and i E {1,..., n}, we add a binary variable y'. jQ and constraints y' s . a ^ y s '., a ~ (1 — Vs^bj) 
for j E {1,2}. We then change the objective function that we minimise to: 


~Xj + 


E 


sG'S'o 


y 

^aGA(s) 


Yl™=i Pi ' -y'si,a) 


Lemma 4.5. Let G be a game, <p = R>JC] a property, {ip,r) a (static or dynamic) penalty 
scheme, and let G' be the game transformed as described above. The following two claims 
are equivalent: 

(1) There is a sound multi-strategy 6 in G with pen dyn (ip, 9) = x (or, for static penalties, 
pen sta (ip, 0) = x), and 0 only uses probabilities that are multiples of jj. 

(2) There is a sound deterministic multi-strategy 6' in G' and pen dyn (ip,9) = x (or, for 
static penalties, £ seSo YT=iPi • ip’(i) = x where ip'(i) is EaeA( s )\(0'K)u0'(4)) ^( s > a ) */ 
0'(si) = { 61 , 62 }, and otherwise ip'(i) is Yla£A(s)\6'(s , .)' l K s i a ) where j is the (unique) 
number with bj E O'(si) ). 


Proof. Firstly, observe that for any integer 0 St k ^ M there is a set I *. C {1,..., m} with 
J2j£i k h = k- The opposite direction also holds. 

Let 9 be a multi-strategy in G. By Theorem 4.4 we can assume that \supp(9{s))\ ^ 2 for 
any s. We create 9’ as follows. For every state s E with {Ai,A 2 } = supp{9(s)), we set 
= 1 and 0'(s' 2 )(A2) = 1. Then, supposing 9(s)(A) = jj, we let ^ / (si)({ 6 i}) = 1 
whenever i E Ik, and 6 / (si)({ 62 }) = 1 whenever i 0 Ik- If 9(s) is a singleton set, the 
construction is analogous. Clearly, the property for static penalties is preserved. For 
any memoryless o' < 9' there is a memoryless strategy a < 9 that is given by cr(s)(a) = 
jj ' a ( s i)( a ) + (1 ~ jj) ■ °'('S / i)(a) f° r any a, and conversely for any a < 9 we can define a' < 9' 
by putting (/(s}) = d SjJ 4 i and a'(s' 2 ) = d S) A 2 f° r all s , where and d s ,A 2 are distributions 
witnessing that a is compliant with 9. It is easy to see that both <r and a' in either of the 
above yield the same reward and dynamic penalty. 

In the other direction, we define 9 from 9' for all s E So as follows. Let A\ and A 2 be 
the sets allowed by O' in and s 2 respectively. If A\ = A 2 , then 0(s) allows this set with 
probability 1. Otherwise 9(s) allows the set A\ U A 2 with probability ^ i .g( s .)_{ 6i b 2 jPi, the 
set Aj with probability J2i-e(s 1 )={b 1 } Pi an d the set A 2 with probability Yli-e(s l )={b 2 }P i - 
correctness can be proved similarly to above. □ 


Our next goal is to show that, by varying the granularity M, we can get arbitrarily close 
to the optimal penalty for a randomised multi-strategy and, for the case of static penalties, 
define a suitable choice of M. This will be formalised in Theorem |4.7| shortly. First, we need 
to establish the following intermediate result, stating that, in the static case, in addition to 
Theorem 4.4 we can require the action subsets allowed by a multi-strategy to be ordered 
with respect to the subset relation. 


Theorem 4.6. Let G be a game, <f = R r >6 [C] a property and {ip, sta ) a static penalty scheme. 
For any sound multi-strategy 9 we can construct another sound multi-strategy 9' such that 
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pen sta {^, 0) ^ pen sta (il>,9') and for each s G S§, if supp(9'(s))={B,C}, then either B C C 
or C C B. 


Proof. Let 8 be a multi-strategy and fix si such that 8 takes two different actions B and C 
with probability p G (0,1) and 1 — p where B C and C $7 B. If inf^ ^ E^ ,7T si (r) = 0, then 
we can in fact allow deterministically the single set j4(si) and we are done. Hence, suppose 
that the reward accumulated from si is non-zero. 

Suppose, w.l.o.g., that: 


minr(si,a) + ^ S(si,a)(s') • inf E^ s ,{r) ^ mmr(si,a) + ^ 6(si,a)(s') • inf E^,(r) 


cr<!0,7r 


aeC 


s'GS s'£S 

It must be the case that, for some D G {B, C}, we have: 




mm 

aeD 


inr(si, a) + inf V'' <S(si, a)(s') ■ EZ’ ^,(r|si) > 0 

: D ir * ^ °" 


CT<0,7V Z - 

s'es 


(4.16) 

(4.17) 


(otherwise the minimal reward accumulated from si is 0 since there is a compliant strategy 
that keeps returning to si without ever accumulating any reward), and if the inequality in 
(4.16) is strict, then (4.17) holds for D = C. W.l.o.g., suppose that the above property holds 
for C. We define 8' by modifying 8 and picking B U C with probability p, C with (1 — p), 
and B with probability 0. 

Under 8, the minimal reward achievable by some compliant strategy is given as the 
least solution to the following equations [25] Theorem 7.3.3] (as before, the notation of 
requires “negative” models): 


x s = 


9(s)(A) ■ min r(s, a) + 5(s, a) (s') ■ x s i 


A£supp(0(s)) 


for s G 5, 


s'es 


x s = min ) r(s, a) + S(s, a)(s') ■ x s 


o 


for sG5q 


The minimal rewards x' s achievable under 9’ are defined analogously. In particular, for the 
equation with si on the left-hand side we have: 

x Sl = p • minr(si, a) + / <5(si, a)(s / ) • x s > + (1 — p) ■ minr(si, a) + > d(si, a)(s') ■ x s i 

a£B J ncn z—/ 


aeC 


x' Sl =P- min r(s 1 ,a) + ^ <5(si, a)(s') • x' s , + (1 - p) • minr(si,a) + ^ S(si,a){s') • x' s 


a£BUC 


s'&S 


a£C 


s'eS 


We show that the least solution x to x is also the least solution to x'. 

First, note that x is clearly a solution to any equation with s / si on the left-hand side 
since these equations remain unchanged in both sets of equations. As for the equation with si, 
we have min aeB Y) s ' r ( s i> a ) + <5(si, a)(s')-x s > ^ mm ae c Yl s ' r ( s i> a ) + ^( s i; a )( s> ) '^s'i and so 
necessarily min aeB ]T) S , r(si, a) + <5(si, a)(s') ■ x s t = min aeB uc Yl s ' r ( s i> °) + a )( s ') ■ x s t. 

To see that x is the least solution to x', we show that (i) for all s, if inf a< g/ i7r E^(r) = 0 
then x s = 0; and (ii) there is a unique fixpoint satisfying x s = 0 for all s such that 
inf a< 6',nE^ s (r) = 0 . 

For (i), suppose x s > 0. Let a' be a strategy compliant with 9 ', and ir an arbitrary 
strategy. Suppose Pr’ff (F si) = 0, then there is a strategy a compliant with 8 which 
behaves exactly as o' when starting from s, and by our assumption on the properties of x s 
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we get that E^(r) > 0 and so (r) > 0. Now suppose that Pr a Q (F si) > 0. For this 


G,s 


G ,s 


IS 


case, by condition (4.17), the fact that it holds for D = C and by defining O' so that it picks 
C with nonzero probability we get that the reward under any strategy compliant with O' 
non-zero when starting in si, and so Eq'* (r) > 0 . 

Point (ii) can be obtained by an application of 


Proposition 7.3.4], 


□ 


We can now return to the issue of how to vary the granularity M to get sufficiently 
close to the optimal penalty. We formalise this as follows. 


Theorem 4.7. Let G be a game, f = R>JC] a property, and (■ ip,r) a (static or dynamic) 
penalty scheme. Let 0 be a sound multi-strategy. For any e > 0, there is an M and a sound 
multi-strategy O' of granularity M satisfying pen T (ffi,0') — pen T (ffi,0) e. Moreover, for 
static penalties it suffices to take M = |~X]<!eSaeA(s) ^e’°h ■ 

Proof. By Theorem 4.4, we can assume supp{0(t)) = {Ai,A 2 } for any state t E £<>. 

We deal with the cases of static and dynamic penalties separately. For static penalties, 
let t E S<) and 0(t)(Ai) = q, 0(t)(A2) = 1 — q for A\ C A 2 C Aft). Modify 0 by rounding 
q up to the number q' which is the nearest multiple of jj. The resulting multi-strategy 
O' is again sound, since any strategy compliant with O' is also compliant with 0: the 
witnessing distributions (see Definition 3.2) dt } A\ and dt,A 2 f° r $ are obtained from the 
distributions d' t A and d' t An for O' by setting dt,Ai (a) = d' t A (a) for all a E A\ and dt ) A 2 ( a ) = 

■ d' t Ai (a) + ■ d' t A ^ (a) for all a E A% note that both dt,A\ and d t ^ 2 are indeed 

probability distributions. Further, the penalty in O' changes by at most ^ A(s) if{t,a). 
To obtain the result we repeat the above for all t. 

Now let us consider dynamic penalties. Intuitively, the claim follows since by making 
small changes to the multi-strategy, while not (dis)allowing any new actions, we only cause 
small changes to the reward and penalty. 

Let 0 be a multi-strategy and f E S<> a state. W.l.o.g., suppose: 

inf min r(s, a) + 5(s, a)(s') ■ EfF(r) ^ inf min r(s, a) + S(s, a)(s') ■ Ef ,w ,(r) 

a<}9,TT a£Ai ' a<9,TT a£A 2 

s' s' 

For 0 < x < 0(t)(A2), we define a multi-strategy 0 X by 0 x (t)(A\) = 0(t){A\) + x and 
0 x (t)(A2) = 0 (t)(A2)—x , and 0 x (s) = 0(s) for all s / t. We will show that inf CT< 0 . f; 7r EfL(r) ^ 


inf CT< 1 0 ) 7 r Eq™ (r). Consider the following functional F x : (S —> M) —> (S —> M), 
for the multi-strategy 0 X 

F x (f)(s)= V 0 x (s) ■ min V r(s,a) +S(s,a)(s') ■ f(s') 


G,s ' 

constructed 


E - s E 

A£supp(8(s)) s'GS 


F x(f)(s) = min V r(s,a) + 6 (s,a)(s') ■ f(s') 

a ^ s) 7ts 


for s E S<> 


for s E So 


Let / be the function assigning inf a <j8 ,-k Eff 7 ) (r) to s. Observe that f(s) = 0 whenever 
inf a<e x ,ir Eo^fr)’, this follows since x < min{ 0 (i)(Ai), 6 (t)(A 2 )} and so both 0 and 0 X 
allow the same actions with non-zero probability. Also, F x {f){s) ^ /(s): for s t in 
fact F x (f)(s) = f(s) because the corresponding functional F for 0 coincides with F x on 
s; for s = t, we have F x (f)(s) ^ f(s) since min aeAl r(s,a) + <5(s, a)(s') • f(s') ^ 

min agj 4 2 r(s,a) + Y) s ' $( s i a ){ s ') ' f ( s 0 by the properties of A\ and A 2 and since x is non¬ 
negative. Hence, we can apply (25J, Proposition 7.3.4] and obtain that 0 X ensures at least 
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the same reward as 9. Thus, by increasing the probability of allowing A\ in t the soundness 
of the multi-strategy is preserved. 

Further, for any strategy a' compliant with 9 X and any 7 r, the penalty when starting in 
t, i.e. Eq ^('iprew)’ i s equal to: 

P en loc{^, 0 X , t) + Y °)(0 ' ( E G,i7i^rewU) + PTq £(F t) ■ E^f (iprew)) 

aeA t'eS 

for = a' U 7 r. There is a strategy a compliant with 9 which differs from o’ only on t, where 
Yha&A l 0 ’ / (' s > a ) — a ( s -> °)l ^ x - We have, for any 7r: 

EZ^rew) 

= pen loc (il>, 0,t) + Y £(*> a ) a )( s ) ‘ + Pr G ,s( F f ) ‘ E Ztt’rew)) 

aeA s£S 

> P en loc Wh Ox, t)-X+^2^(t, a)-x) Y S (t, a )( s )i E Gf (Zrewi^+P^G.s ( F O'-^G J(V’re^)) 
asA s£S 

where £ = a U 7r and the rest is as above. 

Thus: 

ire'rr.i.Or \ _ P en loc(^,0x,t) + J2aeA^( t )( a )T lt 'eS S ( t ^ a )( t/ ) ■ Ptp^rewit) 

,t yrrew) — 

\Zr 


EZt^reJ > 


1 - EaeA Et'es S ( *> a )(f) ' Pr G 't> ( F *) 

Peniod^, 0,t) — x + EagA(^(*)( Q ) ~ x ) Et'es 6 ( p a )(Q ~ E g$(^LvU) 
1 - EaeA^ 7 (*)(«) - X ) Et'GS 6 ( pa )( t ') • Pr G,P ( F *) 


and SO E^f {ZreJ ~ EqJ Ore J goes to 0 as x goes to 0. Hence, pen dyn (ip, 9 X ) —pen dyn (ip, 6) 
goes to 0 as s goes to 0 . 

The above gives us that, for any error bound e and a fixed state s, there is an x such 
that we can modify the decision of 9 in s by x, not violate the soundness property and 
increase the penalty by at most e/|5|. We thus need to pick M such that 1/M ^ x. To 
finish the proof, we repeat this procedure for every state s. □ 


For the sake of completeness, we also show that Theorem 4.6 does not extend to dynamic 
penalties. This is because, in this case, increasing the probability of allowing an action can 
lead to an increased penalty if one of the successor states has a high expected penalty. An 
example is shown in Fig. [5j for which we want to reach the goal state S 3 with probability at 
least 0.5. 



This implies 6(sq, {6})-0(si)({d})^O.5, and so 0(so)({h})>O, f?(si)({d})>0. If 9 satisfies the 
condition of Theorem 4.6 then #(so)({c}) = #( s i)({ e }) = 0, so an opponent can always use b, 
forcing an expected penalty of 0(so)({fr}) + 0(si)({d}), for a minimal value of y/2. However, 
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the sound multi-strategy 6 with 0(so)({&}) = ^( s o)({c})=O.5 and 0(si,{d})=l achieves a 
dynamic penalty of just 1. 

4.4. Optimisations. We conclude this section on MILP-based multi-strategy synthesis by 
presenting some optimisations that can be applied to our methods. The general idea is to 
add additional constraints to the MILP problems that will reduce the search space to be 
explored by a solver. We present two different optimisations, targeting different aspects of 
our encodings: (i) possible variable values; and (ii) penalty structures. 

Bounds on variable values. In our encodings, for the variables x s , we only specified very 
general lower and upper bounds that would constrain their value. Narrowing down the set 
of values that a variable may take can significantly reduce the search space and thus the 
solution time required by an MILP solver. One possibility that works well in practice is to 
bound the values of the variables by the minimal and maximal expected reward achievable 
from the given state, i.e., add the constraints: 

inf ETRr) ^ x s ^ sup Eg]] (r) for all s 6 S 

Cr ? 7r ’ (T,7T ’ 

where both the infima and suprema above are constants obtained by applying standard 
probabilistic model checking algorithms. 

Actions with zero penalties. Our second optimisation exploits the case where an action 
has zero penalty assigned to it. Intuitively, this action could always be disabled without 
harming the overall penalty of the multi-strategy. On the other hand, enabling an action 
with zero penalty might be the only way to satisfy the property and therefore we cannot 
disable all such actions. However, it is enough to allow at most one action that has zero 
penalty. For simplicity of the presentation, we assume Z s = {a £ A(s) | t/>(s, a) = 0}; then 
formally we add the constraints ]T) agZs Vs,a ^ 1 for all sG5{>. 

5. Experimental Results 

We have implemented our techniques within PRISM-games |9j, an extension of the PRISM 
model checker [21] for performing model checking and strategy synthesis on stochastic 
games. PRISM-games can thus already be used for (classical) controller synthesis problems 
on stochastic games. To this, we add the ability to synthesise multi-strategies using the 
MILP-based method described in Section [4j Our implementation currently uses CPLEX [32| 
or Gurobi [33] to solve MILP problems. We investigated the applicability and performance 
of our approach on a variety of case studies, some of which are existing benchmark examples 
and some of which were developed for this work. These are described in detail below and 
the files used can be found online [ 53] , Our experiments were run on a PC with a 2.8GHz 
Xeon processor and 32GB of RAM, running Fedora 14. 

5.1. Deterministic Multi-strategy Synthesis. We first discuss the generation of optimal 
deterministic multi-strategies, the results of which are presented in Tab.s [I] and [2} Tab. [I] 
summarises the models and properties considered. For each model, we give: the case study 
name, any parameters used, the number of states (\S\) and of controller states (|5<)|), and 
the property used. The final column gives, for comparison purposes, the time required for 
performing classical (single) strategy synthesis on each model and property (j). 
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Name 

[parameters] 

Parameter 

values 

States 

Controller 

states 

Property 

Strat. synth. 
time (s) 

cloud 

5 

8,841 

2,177 

P^o. 9999 [F deployed] 

0.04 

[vm\ 

6 

34,953 

8,705 

P^o .999 [ F deployed} 

0.10 

android 

1, 48 

2,305 

997 


0.16 


2, 48 

9,100 

3,718 

■p time r p l 

n <10000 J 

0.57 

[r, s] 

3, 48 

23,137 

9,025 


0.93 

mdsm 

3 

62,245 

9,173 

P<o.i[F deviated ] 

5 64 

\N] 

3 

62,245 

9,173 

P^o.oi[F deviated ] 


investor 

5,10 

10,868 

3,344 

^profit [(1 i 

rt ^4.98[ u J 

0.87 

[vinit, vmax\ 

10, 15 

21,593 

6,644 

R profit [„! 
n ^8.99 I'M 

2.20 

team-form 

3 

12,476 

2,023 


0.20 

\N1 

4 

96,666 

13,793 

P^o.9999 [ P done i] 

0.83 

cdmsn [IV] 

3 

1240 

604 

P^o .9999 [ P prefer x ] 

0.05 


Table 1: Details of the models and properties used for experiments with deterministic 
multi-strategies, and execution times for single strategy synthesis. 


Name 

[parameters] 

Parameter 

values 

Penalty 

Multi-strategy synthesis time (s) 

CPLEX 

Gurobi 

No-opts 

Bounds 

Zero 

Both 

Both 

cloud 

5 

0.001 

2.5 

3.35 

13.04 

10.36 

1.45 

[vm] 

6 

0.01 

62.45 

* 

63.59 

25.37 

4.73 

android 

1, 48 

0.0009 

1.07 

0.66 

1.04 

0.48 

0.53 

2, 48 

0.0011 

28.56 

8.41 

28.48 

8.42 

3.6 

[r, s] 

3, 48 

0.0013 

* 

13.41 

* 

13.30 

47.62 

mdsm 

3 

52 

28.06 

36.28 

27.88 

33.72 

19.40 

[N] 

3 

186 

11.89 

11.57 

11.88 

11.56 

12.27 

investor 

5,10 

1 

68.64 

131.38 

68.90 

131.36 

12.02 

\vinit , vmax) 

10, 15 

1 

* 

* 

* 

* 

208.95 

team-form 

3 

0.890 

0.15 

0.26 

0.15 

0.26 

0.80 

[IV] 

4 

0.890 

249.36 

249.49 

186.41 

184.50 

3.84 

cdmsn [IV] 

3 

2 

0.57 

0.62 

0.62 

0.61 

1.65 


* No optimal solution to MILP problem within 5 minute time-out. 


Table 2: Experimental results for synthesising optimal deterministic multi-strategies. 


In Tab. [2] we show, for each different model, the penalty value of the optimal multi¬ 
strategy and the time to generate it. We report several different times, each for different 
combinations of the optimisations described in Section 4.4 (either no optimisations, one or 
both). For the last result, we give times for both MILP solvers: CPLEX and Gurobi. 


Case studies. Now, we move on to give further details for each case study, illustrating 
the variety of ways that permissive controller synthesis can be used. Subsequently, we will 
discuss the performance and scalability of our approach. 
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cloud: We adapt a PRISM model from [6j to synthesise deployment of services across 
virtual machines (VMs) in a cloud infrastructure. Our property cf> specifies that, with high 
probability, services are deployed to a preferred subset of VMs, and we then assign unit 
(dynamic) penalties to all actions corresponding to deployment on this subset. The resulting 
multi-strategy has very low expected penalty (see Tab. [ 2 ]) indicating that the goal (f) can be 
achieved whilst the controller experiences reduced flexibility only on executions with low 
probability. 

android: We apply permissive controller synthesis to a model created for runtime control of 
an Android application that provides real-time stock monitoring. We extend the application 
to use multiple data sources and synthesise a multi-strategy which specifies an efficient 
runtime selection of data sources (cf> bounds the total expected response time). We use static 
penalties, assigning higher values to actions that select the two most efficient data sources 
at each time point and synthesise a multi-strategy that always provides a choice of at least 
two sources (in case one becomes unavailable), while preserving <f>. 

mdsm: Microgrid demand-side management (MDSM) is a randomised scheme for managing 
local energy usage. A stochastic game analysis [8] previously showed it is beneficial for users 
to selfishly deviate from the protocol, by ignoring a random back-off mechanism designed to 
reduce load at busy times. We synthesise a multi-strategy for a (potentially selfish) user, with 
the goal (4>) of bounding the probability of deviation (at either 0.1 or 0.01). The resulting 
multi-strategy could be used to modify the protocol, restricting the behaviour of this user to 
reduce selfish behaviour. To make the multi-strategy as permissive as possible, restrictions 
are only introduced where necessary to ensure <f>. We also guide where restrictions are made 
by assigning (static) penalties at certain times of the day. 

investor: This example |23| synthesises strategies for a futures market investor, who chooses 
when to reserve shares, operating in a (malicious) market which can periodically ban him/her 
from investing. We generate a multi-strategy that achieves 90% of the maximum expected 
profit (obtainable by a single strategy) and assign (static) unit penalties to all actions, 
showing that, after an immediate share purchase, the investor can choose his/her actions 
freely and still meet the 90% target. 

team-form: This example m synthesises strategies for forming teams of agents in order to 
complete a set of collaborative tasks. Our goal (<j>) is to guarantee that a particular task is 
completed with high probability (0.9999). We use (dynamic) unit penalties on all actions 
of the first agent and synthesise a multi-strategy representing several possibilities for this 
agent while still achieving the goal. 

cdmsn: Lastly, we apply permissive controller synthesis to a model of a protocol for collective 
decision making in sensor networks (CDMSN) [8]. We synthesise strategies for nodes in the 
network such that consensus is achieved with high probability (0.9999). We use (static) 
penalties inversely proportional to the energy associated with each action a node can perform 
to ensure that the multi-strategy favours more efficient solutions. 

Performance and scalability. Unsurprisingly, permissive controller synthesis is more 
costly to execute than classical controller synthesis - this is clearly seen by comparing the 
times in the rightmost column of Tab. [I] with the times in Tab. [2j However, we successfully 
synthesised deterministic multi-strategies for a wide range of models and properties, with 
model sizes ranging up to approximately 100,000 states. The performance and scalability of 
our method is affected (as usual) by the state space size. In particular, it is also affected 
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Name 

[parameters] 

Parameters 

values 

States 

Controller 

states 

Property 

android 

1,1 

49 

10 

P^o.9999[F done\ 

\r, s] 

1,10 

481 

112 

P^o .999 [ F done\ 

cloud 

\vm] 

5 

8,841 

2,177 

P^o. 9999 [F deployed] 

investor 
\vinit, vmax] 

5,10 

10,868 

3,344 

j. profit r p I 
n '^4.98 L u J 

team-form 

\N] 

3 

12,476 

2,023 

P^o .9999 [ F done i] 

cdmsn 

[JV] 

3 

1240 

604 

P>0.9999 [ F prefer J 


Table 3: Details of models and properties for approximating optimal randomised multi¬ 
strategies. 


by the number of actions in controller states, since these result in integer MILP variables, 
which are the most expensive part of the solution. 

The performance optimisations presented in Section 4.4 often allowed us to obtain 
an optimal multi-strategy quicker. In many cases, it proved beneficial to apply both 
optimisations at the same time. In the best case ( android , r=3, s=48), an order of magnitude 
gain was observed. We reported a slowdown after applying optimisation in the case of 
investor. We attribute this to the fact that adding bounds on variable value can make 
finding the initial solution of the MILP problem harder, causing a slowdown of the overall 
solution process. 

Both solvers were run using their off-the-shelf configuration and Gurobi proved to be 
a more efficient solver. In the case of CPLEX, we also observed worse numerical stability, 
causing it to return a sub-optimal multi-strategy as optimal. In the case of Gurobi, we did 
not see any such behaviour. 


5.2. Randomised multi-strategy synthesis. Next, we report the results for approximat¬ 
ing optimal randomised multi-strategies. Tab. [3] summarises the models and properties used. 
In Tab. [4j we report the effects on state space size caused by adding the approximation 
gadgets to the games. We picked three different granularities M = 100, M = 200 and 
M = 300; for higher values of M we did not observe improvements in the penalties of the 
generated multi-strategies. Finally, in Tab. [5| we show the penalties obtained by the ran¬ 
domised multi-strategies that were generated. We compare the (static) penalty value of the 
randomised multi-strategies to the value obtained by optimal deterministic multi-strategies 
for the same models. The MILP encodings for randomised multi-strategies are larger than 
deterministic ones and thus slower to solve, so we impose a time-out of 5 minutes. We used 
Gurobi as the MILP solver in every case. 

We are able to generate a sound multi-strategy for all the examples; in some cases it is 
optimally permissive, in others it is not (denoted by a * in Tab. [5]). As would be expected, 
often, larger values of M give smaller penalties. In some cases, this is not true, which we 
attribute to the size of the MILP problem (which grows with M). For all examples, we built 
randomised multi-strategies with smaller or equal penalties than the deterministic ones. 
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Name 

Parameters 

States 

Controller 

States 

[parameters] 

values 

states 

M =100 

M=200 

M=300 

android 

1,1 

49 

10 

90 

94 

98 

[r, s] 

1,10 

481 

112 

1629 

1741 

1853 

cloud [vm] 

5 

8,841 

2,177 

29447 

32686 

35233 

investor 
\vinit, vmax] 

5,10 

10,868 

3,344 

33440 

35948 

38456 

team-form 

\N] 

3 

12,476 

2,023 

31928 

33716 

35504 

cdmsn [IV] 

3 

1240 

604 

3625 

3890 

4155 


Table 4: State space growth for approximating optimal randomised multi-strategies. 


Name 

[parameters] 

Parameters 

values 

States 

Controller 

states 

Pen. 

(det.) 

Penalty (randomised) 

M=100 

M=200 

M=300 

android 
[r, s] 

1,1 

1,10 

49 

481 

10 

112 

1.01 

19.13 

0.91 

12.27* 

0.905 

9.12* 

0.903 

17.18* 

cloud 

[vm] 

5 

8,841 

2,177 

1 

0.91* 

0.905* 

0.91* 

investor 
\vinit, vmax] 

5,10 

10,868 

3,344 

1 

1* 

1* 

1* 

team-form 

\N ] 

3 

12,476 

2,023 

264 

263.96* 

263.95* 

263.95* 

cdmsn [IV] 

3 

1240 

604 

2 

0.38* 

1.9* 

0.5* 


* Sound but possibly non-optimal multi-strategy obtained after 5 minute MILP time-out. 


Table 5: Experimental results for approximating optimal randomised multi-strategies. 

6. Conclusions 

We have presented a framework for permissive controller synthesis on stochastic two-player 
games, based on generation of multi-strategies that guarantee a specified objective and are 
optimally permissive with respect to a penalty function. We proved several key properties, 
developed MILP-based synthesis methods and evaluated them on a set of case studies. 

In this paper, we have imposed several restrictions on permissive controller synthesis. 
Firstly, we focused on properties expressed in terms of expected total reward (which also 
subsumes the case of probabilistic reachability). A possible topic for future work would be 
to consider more expressive temporal logics or parity objectives. The results might also 
be generalised so that both positive and negative rewards can be used, for example by 
using the techniques of [30]. We also restricted our attention to memoryless multi-strategies, 
rather than the more general class of history-dependent multi-strategies. Extending our 
theory to the latter case and exploring the additional power brought by history-dependent 
multi-strategies is another interesting direction of future work. 
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Appendix A. Appendix 


A.l. Proof of Theorem 3.11 (NP-hardness). We start with the case of randomised 
multi-strategies and static penalties which is the most delicate. Then we analyse the case 
of randomised multi-strategies and dynamic penalties, and finally show that this case can 
easily be modified for the remaining two combinations. 


Randomised multi-strategies and static penalties. We give a reduction from the 
Knapsack problem. Let n be the number of items, each of which can either be or not be 
put in a knapsack, let Vi and Wi be the value and the weight of item i, respectively, and let 
V and W be the bounds on the value and weight of the items to be picked. We assume 
that V{ ^ 1 for every 1 ^ i ^ n, and that all numbers V{ and Wi are given as fractions with 
denominator q. Let us construct the following MDP, where m is chosen such that 2~ m < ^ 
and 2~ m ■ W ^ K 



The rewards and penalties are as given by the overlined and underlined expressions, and 
set to 0 where not present. In particular, note that for any state s different from T the 
probability of reaching T from s is the same as the expected total reward from s. 

We show that there is a multi-strategy 9 sound for the property R,>y, [C] such that 
P en sta('4’i 6) ^ W + 2 -m • W if and only if the answer to the Knapsack problem is “yes”. 

In the direction <=, let I C {1, ..., n} be the set of items put in the knapsack. It suffices 
to define the multi-strategy 9 by: 

. 9{t' i ){{c i ,d i }) = 1 - 2~ 4m , 9(t' l )({d i }) = 2“ 4m , 0(ti)({ai}) = 1 for iel, 

• 9(t' i )({c i ,di}) = 1, 9(ti)({ai,bi}) = 1 for i 0 I. 

In the direction =>, let us have a multi-strategy 9 satisfying the assumptions. Let 
P(s -A s') denote the lower bound on the probability of reaching s' from s under a strategy 
which complies with the multi-strategy 9. Denote by I C {1,... ,n} the indices i such that 
P(U-+ 

Let ^ = 9(ti)({a,i}) and a.i = ^(^i)({^*})- We will show that for any i 6 / we 
have ctj ^ 2” m (l — Pi). When Pi = 1, this obviously holds. For Pi < 1, assume that 
a.i ^ 2 -m (l — PP). Because the optimal strategy a will pick hi and c* whenever they are 
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available, we have: 


P(U ->T) = fa- ^((1 - ou) • Pi) j • • 

3=0 

OiPi 2~ m 


Vi = 


OLiPiVi 


OLiPiVi 


< 


< 


1 - (1 - Oi)Pi 1 - Pi + OLiPi 
(1 -Pi) Pi „ 2 ~ m pi 


< 


< 2 " 


1 - Pi + oiiPi 1 - Pi + 2 _m (l - pi)Pi ' 1 + 2-"»/3i 
which is a contradiction with i G /. Hence, a* ^ 2 _m (l — /%) and so: 

P e n sta ('ip, 9, U ) + pens'll), 9, t') = PiWi + a.i2 3m Wi ^ PiWi + 2 -m (l - Pi)2 im Wi ^ w; 


We have: 

(pen sta (ip, 9, ti ) + pen sta (i’, 9, f')) IF + 2" m • IF 

iS-f i6/ 

and, because w i an d IF are fractions with denominator g, by the choice of m, we can 
infer that ^ IF- Similarly: 


2 ^ S -P(t 

n ' n 

iei iei 


i n 

T )>(^E F <*.^T)) 

i =1 


—2 -m n ^ —V - 2~ m 
n n 


and again, because Yliei v i an d V are fractions with denominator q, by the choice of m we 
can infer that Vi P H. Hence, in the instance of the knapsack problem, it suffices to 
pick exactly items from I to satisfy the restrictions. 


Randomised multi-strategies with dynamic penalties. The proof is analogous to the 
proof above, we only need to modify the MDP and the computations. For an instance of 
the Knapsack problem given as before, we construct the following MDP: 



We claim that there is a multi-strategy 9 sound for the property R>y/ n [C] such that 
peridyni'P’i 9) F d-W if and only if the answer to the Knapsack problem is ‘yes”. 

In the direction -4=, for / C {1,..., n} the set of items in the knapsack, we define 9 by 
9(ti)({cii}) = 1 for i e I and by allowing all actions in every other state. 

In the direction =4-, let us have a multi-strategy 9 satisfying the assumptions. Let 
P(s —> s') denote the lower bound on the probability of reaching s' from s under a strategy 
which complies with the multi-strategy 9. Denote by I C {1,..., n} the indices i such that 
9(ti)({a,i}) > 0. Observe that P(ti —> T) = Vi if i G I and P(ti -A T) = 0 otherwise. Hence: 


E -”< = E - 

iei iei 


P(U -> T) = - V P(u 


n 


i =1 


T) ^ —V 
n 
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and for the penalty, denoting Xi := 0(ij)({aj}), we get: 

^ -y n oo ^ oo ^ 

-w ^ pen d (^,9) = - VV(1 - Xi) j XiWi = - V'VVl - Xi) j XiWi = Y" w t (A.l) 

n n z —' z —' n z —' z —' n z —' 

z=0 j=0 zG/ j=0 zG/ 

because the strategy that maximises the penalty will pick Ip whenever it is available. Hence, 
in the instance of the knapsack problem, it suffices to pick exactly items from I to satisfy 
the restrictions. 

Deterministic multi-strategies and dynamic penalties. The proof is identical to 
the proof for randomised multi-strategies and dynamic penalties above: observe that the 
multi-strategy constructed there from an instance of Knapsack is in fact deterministic. 

Deterministic multi-strategies and static penalties. The proof is obtained by a small 
modification of the proof for randomised multi-strategies and dynamic pena lties above. 
Instead of requiring pen dyn {^,9) ^ we require pen sta (ij),9) ^ W and (A.l) changes to: 


W > Pen sta (^, 9) = XiWi = ^ 


i =0 


= 2^ Wi 
i£l 


A.2. Proof of Theorem 3.12 (Upper Bounds). We consider the two cases of determin¬ 


istic and randomised multi-strategies separately, showing that they are in NP and PSPACE, 
respectively. To simplify readability, the proofs make use of constructions that appear later 
in the main paper than Theorem 3.12 Note that those constructions do not build on the 


theorem and so there is no cyclic dependency. 

Deterministic multi-strategies. For deterministic multi-strategies, it suffices to observe 
that the problem can be solved by verifying an MILP instance constructed in polynomial 
time (and, additionally, in the case of dynamic penalties: a polynomial-time identification of 
the infinite-penalty case - see Theorem 4.2 and Theorem 4.3). Since the problem of solving 


an MILP instance is in NP, the result follows. 

Randomised multi-strategies. We now show that the permissive controller synthesis 
problem is in PSPACE for randomised multi-strategies and static penalties. The proof for 
dynamic penalties is similar. 

The proof proceeds by constructing a polynomial-size closed formula T of the existential 
fragment of (M, +, - , ^) such that T is true if and only if there is a multi-strategy ensuring 
the required penalty and reward. Because determining the validity of a closed formula of 
the existential fragment of (M, +, •, is in PSPACE [7j, we obtain the desired result. 

We do not construct the formula T explicitly, but only sketch the main idea. Recall 


that in Section T3 we presented a reduction that allows us to approximate the existence 
of a multi-strategy using the construction described on page Hi and in Fig. [4j Note that 
if we knew the probabilities with which the required multi-strategy 9 chooses some sets in 
a state s, we could use these probabilities instead of the numbers pi, ■ ■ ■ ,p m i n Fig- 0 
fact, by Theorem 4.4 we would only need n = 2, i.e. two numbers per state. Now knowing 


these numbers p\ for each state, we can construct, in polynomial time a polynomial-size 
instance (disregarding the size of representation of the numbers pf ,p|) °f an MILP problem 
such that the optimal solution for the problem is the optimal reward/penalty under the 
multi-strategy 9. Of course, we do not know the numbers pf ,p| a priori, and so we cannot 
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1 — X\ 



Figure 6: The game for the proof of Theorem 3.13 


use MILP for the solution. Instead, we treat those numbers as variables in the formula d/ 
which is build from the constraints in the MILP problem, by adding existential quantification 
for pf,P 2 for all s G S and all other variables from the MILP problem, requiring p\ to be 
from (0,1) and pf = 1 — and adding the restriction on the total reward. Note that in the 
case of dynamic penalties we need to treat optimal infinite penalty separately, similarly to 
the construction on page [T4j 


A.3. Proof of Theorem 3.13 (Square-root-sum Reduction). Let xi ,... ,x n and y be 

numbers giving the instance of the square-root-sum problem, i.e. we aim to determine 
whether Y^=\ ^ V- We construct the game from Fig. [bj 

The penalties are as given by the underlined numbers, and the rewards 1/xi are awarded 
under the actions Ip. 


Static penalties. We first give the proof for static penalties. We claim that there is a 
multi-strategy 9 sound for the property R> X [C] such that pen sta (ip, 9) ^ 2 • y if and only if 

££=i V^^y- _ 

In the direction 4= let us define a multi-strategy 9 by = 0(tj)({cj}) = y/xl 

and 9{t l i ){{a' i , c'J) = 9(ti)({ai, c*}) = 1 — y/xl. and allowing all actions in all remaining states. 
We then have: pen sta (ip. 6) = Ym=i 2 • y/xl and the reward achieved is: 

i n j ^ 

E min {xi ■ —, y/xi ■ y/¥i — } = 1. 

rp . rp . 

It/ . 

2=1 


In the direction =>, let 9 be an arbitrary multi-strategy sound for the property R> 1 [C] 
satisfying pen sta ('tp, 9) ^ 2-y . Let z\ = 9(t'/)({dj}) and z t = 9(ti){{c/\). The reward achieved 
is: 


1 


n 

E min{xj ■ 

X ' X- 

2=1 


1 . 1 A ■ r, 


2=1 


z ’i • zt-} 
«* / 2 


which is greater or equal to 1 if and only if z! % ■ z % ^ Xj for every i. We show that 
z\ + Zi ^ 2 • y/xl, by analysing the possible cases: If both z\ and z\ are greater than 
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y/xl, we are done. The case z', z % < yfx\ cannot take place. As for the remaining case, 
w.l.o.g., suppose that z\ = yfx[ + p and z t = y/xi — q for some non-negative p and q. Then 
%i ^ {y/xi + p) ■ (y/xi — q) = Xi + (p — q)yfxi — pq , and for this to be at least Xi we necessarily 
have p ^ q, and so z[ + Zj = y/xi + p + y/xl — q ^ 2 • y/x t . 

Hence, we get ]T" = i 2 ' ^ E"=i ( z 'i + z i) = P en sta (Vb 0) ^2-y. 

Dynamic penalties. We now proceed with dynamic penalties, where the analysis is similar. 
Let us use the same game as before, but in addition assume that the penalty assigned to 
actions and c[ is equal to 1. We claim that there is a multi-strategy 9 sound for the 
property R r ^ 1 [C] such that pen dyn (y ;, 9) ^ 2 • y/n if and only if Y^i=i ^ V- 

In the direction <= let us dehne a multi-strategy 9 as before, and obtain pen dyn {/>, 9) = 

In the direction =>, let 9 be an arbitrary multi-strategy sound for the property R> 1 [C] 
satisfying pen dyn ('ip, 9) ^ 2 • y/n . Let z' = 9(t' i ){{c' i \), z* = 0(L)({cj}), v! { = 9(t' i )({a' i }), and 
Ui = 9(ti)({ai}). 

Exactly as before we show that z\ + z* ^ 2 • y/xi, and so: 

1 n i n 1 n 

- 2 ■ < - XI (zi+ *i) ^ -^2 {( z 'i + u 'i) + ( i - u 'i ) • («•+«*)) 

n z —' n z — J v 7 n z — J 7 

2=1 2=1 2=1 

= pen dyn (il>,9) <2-y/n. 
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